In this notebook, we begin our study of Deep Learning. We will use Tensorflow and Keras to build and train Neural Networks (NN) for structured data.
Deep learning is an approach to machine learning characterized by many multi-layered (deep) stacks of computations. This depth of computation is what has enabled deep learning models to understand the complex patterns found in the most challenging real-world datasets.
Some of the most impressive advances in artificial intelligence in recent years have been in the field of deep learning. Natural language translation, image recognition, and game playing are all tasks where deep learning models have neared or even exceeded human-level performance.
Installation:
# Installing or upgrading
# Note: might have to restart kernel
# Uncomment:
# import sys
# Installing:
# !{sys.executable} -m pip install scikit-learn
# Upgrading:
# !{sys.executable} -m pip install --upgrade scipy==1.9.0 --user
Imports:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
import sklearn
print('pandas version:', pd.__version__)
print('numpy version:', np.__version__)
print('tensorflow version:', tf.__version__)
print('sci-kit learn version:', sklearn.__version__)
pandas version: 1.5.2 numpy version: 1.23.0 tensorflow version: 2.11.0 sci-kit learn version: 1.2.0
Set the plotting style:
try:
scientific_style = [
'../../Random/PythonTutorialsForDataScience/data/science.mplstyle',
'../../Random/PythonTutorialsForDataScience/data/notebook.mplstyle',
'../../Random/PythonTutorialsForDataScience/data/grid.mplstyle'
]
plt.style.use(scientific_style)
print('Using Scientific Style.')
except:
print('Missing Scientific Style, continuing with default.')
Using Scientific Style.
Define the filepath where most of the data resides:
path = r'C:\Users\seani\Documents\JupyterNotebooks\Kaggle\KaggleLearn\Assets'
Function used to get names of files in a directory:
import os
def get_files(path):
'''
Inputs: a path string
Returns: a list of names of files in a directory
'''
files = []
# search through each item in the directory
for file in os.listdir(path):
# check it is a file
if os.path.isfile(os.path.join(path, file)):
files.append(file)
return files
The seed used throughout for reproducable randomness:
seed = 1
Through their power and scalability neural networks have become the defining model of deep learning. Neural networks are composed of neurons, where each neuron individually performs only a simple computation. The power of a neural network comes from the complexity of the connections these neurons can form together.
So let's begin with the fundamental component of a neural network: the individual neuron. As a diagram, a neuron (or unit) with one input looks like:
Figure 1: A single neuron (the linear unit). $x$ is the input, multiplied by the weight $w$, while we have a bias weight $b$ which we multiply by an input with value $1$. The output $y$ is thus: $y = w x + b$.
The input is $x$. Its connection to the neuron has a weight which is $w$. Whenever a value flows through a connection, you multiply the value by the connection's weight. For the input $x$, what reaches the neuron is $w x$. A neural network "learns" by modifying its weights.
The $b$ is a special kind of weight we call the bias. The bias doesn't have any input data associated with it; instead, we put a $1$ in the diagram so that the value that reaches the neuron is just $b$ (since $1 \times b = b$). The bias enables the neuron to modify the output independently of its inputs.
The $y$ is the value the neuron ultimately outputs. To get the output, the neuron sums up all the values it receives through its connections. This neuron's activation is y = w * x + b, or as a formula $y = w x + b$. Notice this is the formula of a line, hence the name linear unit.
We can expand this linear unit to include multiple inputs (but remember we have only one bias). We treat each new input the same, each having their own corresponding weight:
Figure 2: A single neuron with multiple inputs. Each input has a corresponding weight. The output $y$ is thus: $y = w_0 x_0 + w_1 x_1 + w_2 x_2 + b$.
As we can see, the formula for this neuron is now: $y = w_0 x_0 + w_1 x_1 + w_2 x_2 + b$. If the set of weights are $\bar{w}$, and the set of inputs $\bar{x}$, we can denote the output using a dot-product: $y = \bar{w} \cdot \bar{x} + b$ (some sources include the bias in the set of weights, with a corresponding input of $1$).
A linear unit with two inputs will fit a plane, and a unit with more inputs than that will fit a hyperplane.
We want to see how we can implement this Linear Unit as a model. Let's work on an example, the Red Wine Quality dataset consists of physiochemical measurements (features) from about $1600$ Portuguese red wines. Also included is a quality rating (target) for each wine from blind taste-tests. We import the data as follows:
# import the data as pandas DataFrame
red_wine = pd.read_csv('Assets/redwine_data.csv')
# get the shape
print(f'Shape of dataset: {red_wine.shape}')
red_wine
Shape of dataset: (1599, 12)
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.4 | 0.700 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.99780 | 3.51 | 0.56 | 9.4 | 5 |
| 1 | 7.8 | 0.880 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.99680 | 3.20 | 0.68 | 9.8 | 5 |
| 2 | 7.8 | 0.760 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.99700 | 3.26 | 0.65 | 9.8 | 5 |
| 3 | 11.2 | 0.280 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.99800 | 3.16 | 0.58 | 9.8 | 6 |
| 4 | 7.4 | 0.700 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.99780 | 3.51 | 0.56 | 9.4 | 5 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1594 | 6.2 | 0.600 | 0.08 | 2.0 | 0.090 | 32.0 | 44.0 | 0.99490 | 3.45 | 0.58 | 10.5 | 5 |
| 1595 | 5.9 | 0.550 | 0.10 | 2.2 | 0.062 | 39.0 | 51.0 | 0.99512 | 3.52 | 0.76 | 11.2 | 6 |
| 1596 | 6.3 | 0.510 | 0.13 | 2.3 | 0.076 | 29.0 | 40.0 | 0.99574 | 3.42 | 0.75 | 11.0 | 6 |
| 1597 | 5.9 | 0.645 | 0.12 | 2.0 | 0.075 | 32.0 | 44.0 | 0.99547 | 3.57 | 0.71 | 10.2 | 5 |
| 1598 | 6.0 | 0.310 | 0.47 | 3.6 | 0.067 | 18.0 | 42.0 | 0.99549 | 3.39 | 0.66 | 11.0 | 6 |
1599 rows × 12 columns
The easiest way to create a model in Keras is through keras.Sequential, which creates a neural network as a stack of layers. We can create models like those above using a dense layer (which we'll learn more about in the next section). With the first argument, units, we define how many outputs we want. With the second argument, input_shape, we tell Keras the dimensions of the inputs.
We want to predict a wine's perceived quality from the physiochemical measurements, these are our features, and there are $11$ of them (input_shape). We want the output to be the quality, so we have one output (units). The model is created as follows:
from tensorflow import keras
from tensorflow.keras import layers
# linear model so only one unit
# all features total to 11 inputs
model = keras.Sequential([
layers.Dense(units=1, input_shape=[11]),
])
model
<keras.engine.sequential.Sequential at 0x1e785d59610>
Internally, Keras represents the weights of a neural network with tensors. Tensors are basically TensorFlow's version of a Numpy array with a few differences that make them better suited to deep learning. One of the most important is that tensors are compatible with GPU and TPU accelerators. TPUs, in fact, are designed specifically for tensor computations.
A model's weights and biases are kept in its weights attribute as a list of tensors ([weights, biases]). Notice though that there doesn't seem to be any pattern to the values the weights have. Before the model is trained, the weights are set to random numbers (and the bias to 0.0). A neural network learns by finding better values for its weights.
# weights and biases are contained together
w, b = model.weights
print(f'Weights:\n{w}')
print(f'Bias:\n{b}')
Weights:
<tf.Variable 'dense/kernel:0' shape=(11, 1) dtype=float32, numpy=
array([[ 0.04872853],
[ 0.565458 ],
[-0.18575346],
[ 0.44701165],
[-0.4108805 ],
[ 0.10907167],
[-0.6606643 ],
[ 0.22212785],
[ 0.5111342 ],
[ 0.42094415],
[ 0.6818982 ]], dtype=float32)>
Bias:
<tf.Variable 'dense/bias:0' shape=(1,) dtype=float32, numpy=array([0.], dtype=float32)>
(By the way, Keras represents weights as tensors, but also uses tensors to represent data.)
In this section we're going to see how we can build NNs capable of learning the complex relationships deep NNs are famous for. The key idea here is modularity, building up a complex network from simpler functional units. We've seen how a linear unit computes a linear function, now we'll see how to combine and modify these single units to model more complex relationships.
Neural networks typically organize their neurons into layers. When we collect together linear units having a common set of inputs we get a dense layer. When these are created between the input and output layers they are often called hidden layers:
Figure 3: A dense (hidden) layer of two neurons receiving two inputs and a bias.
You could think of each layer in a neural network as performing some kind of relatively simple transformation. A layer in Keras is a very general kind of thing. A layer can be, essentially, any kind of data transformation. There are many types of layers that can be used.
It turns out, however, that two dense layers with nothing in between are no better than a single dense layer by itself. Dense layers by themselves can never move us out of the world of lines and planes. What we need is something non-linear. What we need are activation functions. This is simply some function we apply to each of a layer's outputs (its activations).
There are many kinds of activation functions, but the most common is the rectifier function $max(0, x)$. The rectifier function has a graph that's a line with the negative part "rectified" to zero. Applying the function to the outputs of a neuron will put a bend in the data, moving us away from simple lines. When we attach the rectifier to a linear unit, we get a rectified linear unit or ReLU. (For this reason, it's common to call the rectifier function the "ReLU function".) Applying a ReLU activation to a linear unit means the output becomes max(0, w * x + b).
Other functions include the sigmoid, tanh, softmax, etc. They can be called by using keras.layers.Activation('str'), or directly in the Dense layer. Let's look at some supplied by Keras:
fig, axes = plt.subplots(
1,
3,
figsize=(18, 8),
)
# the activation functions looked at
activation_functions = ['relu', 'sigmoid', 'tanh']
# the inputs used
x = tf.linspace(-3.0, 3.0, 100)
# loop through each activation function
for i, function in enumerate(activation_functions):
# get the activation layer with the current activation function
activation_layer = layers.Activation(function)
y = activation_layer(x) # once created, a layer is callable just like a function
# horizontal line at zero
axes[i].axhline(0, color='k', ls='--')
# plot output of function
axes[i].plot(
x,
y,
)
axes[i].set_xlabel(None)
axes[i].set_ylabel(None)
axes[i].set_xlim(-3, 3)
axes[i].set_ylim(None)
axes[i].set_title(f'{function}')
axes[1].set_xlabel('Input')
axes[0].set_ylabel('Output')
Text(0, 0.5, 'Output')
An example of a stack of dense layers is as follows:
Figure 4: A stack of dense layers to create a NN. The hidden layers contain ReLU activation functions (denoted by the symbol). This results in a linear unit output, without an activation function (see symbol).
Now, notice that the output layer is a linear unit (meaning, no activation function). That makes this network appropriate to a regression task, where we are trying to predict some arbitrary numeric value. Other tasks (like classification) might require an activation function on the output.
Previously, we created a simple linear unit model for the Red Wine dataset. Now let's create a model with three hidden layers, each having 512 units and the ReLU activation:
# creating model
model = keras.Sequential([
# hidden layers
layers.Dense(units=512, activation='relu', input_shape=[11]),
layers.Dense(units=512, activation='relu'),
layers.Dense(units=512, activation='relu'),
# output layer (linear unit)
layers.Dense(units=1),
])
model
<keras.engine.sequential.Sequential at 0x1e789074760>
Note that we could have instead created the dense layers seperate to the activation functions, using activation layers. This isn't particularly useful now, but can be for other applications (see section on Batch Normalisation). The above cell can be recreated as follows:
# creating model
model = keras.Sequential([
# hidden layers
layers.Dense(units=512, input_shape=[1]),
layers.Activation('relu'), # activation acts on the above dense layer
layers.Dense(units=512),
layers.Activation('relu'),
layers.Dense(units=512),
layers.Activation('relu'),
# output layer (linear unit)
layers.Dense(units=1),
])
model
<keras.engine.sequential.Sequential at 0x1e787ff1be0>
In the first two sections, we learned how to build fully-connected networks out of stacks of dense layers. When first created, all of the network's weights are set randomly - the network doesn't "know" anything yet. In this lesson we're going to see how to train a neural network; we're going to see how neural networks learn.
As with all machine learning tasks, we begin with a set of training data. Each example in the training data consists of some features (the inputs) together with an expected target (the output). Training the network means adjusting its weights in such a way that it can transform the features into the target. In addition to the training data, we need two more things:
The loss function measures the disparity between the the target's true value and the value the model predicts. During training, the model will use the loss function as a guide for finding the correct values of its weights (lower loss is better).
Different problems call for different loss functions. We have been looking at regression problems, where the task is to predict some numerical value, like the quality in the Red Wine dataset. Other regression tasks might be for example predicting the price of a house or the fuel efficiency of a car.
A common loss function for regression problems is the mean absolute error or MAE. Besides MAE, other loss functions you might see for regression problems are the mean-squared error (MSE) or the Huber loss (both available in Keras).
The optimizer is an algorithm that adjusts the weights to minimize the loss. Virtually all of the optimization algorithms used in deep learning belong to a family called stochastic gradient descent. They are iterative algorithms that train a network in steps. One step of training goes like this:
Then just do this over and over until the loss is as small as you like (or until it won't decrease any further). Each iteration's sample of training data is called a minibatch (or often just batch), while a complete round of the training data is called an epoch. The number of epochs you train for is how many times the network will see each training example.
The gif below shows the linear model from the first section being trained with SGD:
The pale red dots depict the entire training set, while the solid red dots are the minibatches. Every time SGD sees a new minibatch, it will shift the weights (w the slope and b the y-intercept) toward their correct values on that batch. Batch after batch, the line eventually converges to its best fit. You can see that the loss gets smaller as the weights get closer to their true values.
Notice that the line only makes a small shift in the direction of each batch (instead of moving all the way). The size of these shifts is determined by the learning rate. A smaller learning rate means the network needs to see more minibatches before its weights converge to their best values. The size of the minibatches has a similar affect.
Smaller batch sizes give noisier weight updates and loss curves. This is because each batch is a small sample of data and smaller samples tend to give noisier estimates. Smaller batches can have an "averaging" effect though which can be beneficial. Smaller learning rates make the updates smaller and the training takes longer to converge. Large learning rates can speed up training, but don't "settle in" to a minimum as well. When the learning rate is too large, the training can fail completely.
Fortunately, for most work it won't be necessary to do an extensive hyperparameter search to get satisfactory results. Adam is an SGD algorithm that has an adaptive learning rate that makes it suitable for most problems without any parameter tuning (it is "self tuning", in a sense). Adam is a great general-purpose optimizer.
The gradient is a vector that tells us in what direction the weights need to go. More precisely, it tells us how to change the weights to make the loss change fastest. We call our process gradient descent because it uses the gradient to descend the loss curve towards a minimum. Stochastic means "determined by chance." Our training is stochastic because the minibatches are random samples from the dataset. And that's why it's called SGD.
Let's train a NN on the red-wine dataset. The data is reimported below, with some preprocessing (using pandas instead of sci-kit learn). Note also we have rescaled each feature to lie in the interval $[0, 1]$. As we'll discuss more in section 6, neural networks tend to perform best when their inputs are on a common scale.
# Create training and validation splits
X_train = red_wine.sample(frac=0.7, random_state=0)
X_valid = red_wine.drop(X_train.index) # removes all rows with indices belonging to X_train
# Scale to [0, 1]
train_max = X_train.max(axis=0)
train_min = X_train.min(axis=0)
X_train = (X_train - train_min) / (train_max - train_min)
valid_max = X_valid.max(axis=0)
valid_min = X_valid.min(axis=0)
X_valid = (X_valid - valid_min) / (valid_max - valid_min)
# Split features and target
y_train = X_train['quality']
y_valid = X_valid['quality']
X_train = X_train.drop('quality', axis=1)
X_valid = X_valid.drop('quality', axis=1)
X_train
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1109 | 0.548673 | 0.239726 | 0.544304 | 0.092308 | 0.237435 | 0.366197 | 0.212014 | 0.619193 | 0.291262 | 0.260606 | 0.369231 |
| 1032 | 0.309735 | 0.479452 | 0.000000 | 0.246154 | 0.105719 | 0.056338 | 0.028269 | 0.645088 | 0.475728 | 0.121212 | 0.184615 |
| 1002 | 0.398230 | 0.116438 | 0.417722 | 0.088462 | 0.050260 | 0.169014 | 0.074205 | 0.387662 | 0.378641 | 0.309091 | 0.507692 |
| 487 | 0.495575 | 0.359589 | 0.455696 | 0.069231 | 0.032929 | 0.056338 | 0.028269 | 0.619193 | 0.291262 | 0.054545 | 0.246154 |
| 979 | 0.672566 | 0.226027 | 0.620253 | 0.038462 | 0.071057 | 0.028169 | 0.000000 | 0.520183 | 0.252427 | 0.181818 | 0.307692 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 640 | 0.469027 | 0.287671 | 0.569620 | 0.107692 | 0.064125 | 0.211268 | 0.120141 | 0.687738 | 0.504854 | 0.175758 | 0.153846 |
| 104 | 0.230088 | 0.253425 | 0.303797 | 0.100000 | 0.062392 | 0.056338 | 0.106007 | 0.451637 | 0.446602 | 0.090909 | 0.153846 |
| 815 | 0.548673 | 0.226027 | 0.417722 | 0.123077 | 0.112652 | 0.267606 | 0.113074 | 0.617669 | 0.359223 | 0.230303 | 0.369231 |
| 998 | 0.380531 | 0.493151 | 0.430380 | 0.038462 | 0.027730 | 0.042254 | 0.014134 | 0.416603 | 0.242718 | 0.090909 | 0.107692 |
| 1075 | 0.398230 | 0.089041 | 0.430380 | 0.084615 | 0.064125 | 0.619718 | 0.215548 | 0.580350 | 0.553398 | 0.321212 | 0.276923 |
1119 rows × 11 columns
We then create a model with three hidden layers of $512$ neurons each:
Deciding the architecture of your model should be part of a process. Start simple and use the validation loss as your guide.
# create the model
model = keras.Sequential([
# hidden layers
layers.Dense(units=512, activation='relu', input_shape=[11]),
layers.Dense(units=512, activation='relu'),
layers.Dense(units=512, activation='relu'),
# output layer
layers.Dense(units=1),
])
After defining the model, we compile in the optimizer and loss function:
# compile the optimizer and loss function
model.compile(
optimizer='adam',
loss='mae',
)
Now we're ready to start the training. We tell Keras to feed the optimizer $256$ rows of the training data at a time (the batch_size) and to do that $10$ times all the way through the dataset (the epochs):
Keras keeps us updated on the loss as the model trains.
# train the model
history = model.fit(
X_train, y_train,
validation_data=(X_valid, y_valid),
batch_size=256,
epochs=10,
)
Epoch 1/10 5/5 [==============================] - 1s 48ms/step - loss: 0.2834 - val_loss: 0.1447 Epoch 2/10 5/5 [==============================] - 0s 21ms/step - loss: 0.1463 - val_loss: 0.1333 Epoch 3/10 5/5 [==============================] - 0s 21ms/step - loss: 0.1267 - val_loss: 0.1194 Epoch 4/10 5/5 [==============================] - 0s 23ms/step - loss: 0.1177 - val_loss: 0.1075 Epoch 5/10 5/5 [==============================] - 0s 29ms/step - loss: 0.1120 - val_loss: 0.1110 Epoch 6/10 5/5 [==============================] - 0s 30ms/step - loss: 0.1118 - val_loss: 0.1029 Epoch 7/10 5/5 [==============================] - 0s 23ms/step - loss: 0.1080 - val_loss: 0.1102 Epoch 8/10 5/5 [==============================] - 0s 26ms/step - loss: 0.1058 - val_loss: 0.1013 Epoch 9/10 5/5 [==============================] - 0s 46ms/step - loss: 0.1057 - val_loss: 0.1044 Epoch 10/10 5/5 [==============================] - 0s 34ms/step - loss: 0.1023 - val_loss: 0.0989
Often, a better way to view the loss though is to plot it. The fit method keeps a record of the loss produced during training in a History object. We'll convert the data to a dataframe, which makes the plotting easy:
# convert the training history to a dataframe
history_df = pd.DataFrame(history.history)
# use Pandas native plot method
history_df['loss'].plot()
<AxesSubplot:>
Loss is on the y-axis, while epoch is on the x-axis.
Notice how the loss levels off as the epochs go by. When the loss curve becomes horizontal like that, it means the model has learned all it can and there would be no reason continue for additional epochs.
You might think about the information in the training data as being of two kinds: signal and noise. The signal is the part that generalizes, the part that can help our model make predictions from new data. The noise is all of the random fluctuation that comes from data in the real-world or all of the incidental, non-informative patterns that can't actually help the model make predictions. The noise is the part might look useful but really isn't, it is specific to the dataset we are working on, so the noise our model learns from is only true of that training data.
Now, the training loss will go down either when the model learns signal or when it learns noise. But the validation loss will go down only when the model learns signal (whatever noise the model learned from the training set won't generalize to new data). So, when a model learns signal both curves go down, but when it learns noise a gap is created in the curves. The size of the gap tells you how much noise the model has learned.
Underfitting the training set is when the loss is not as low as it could be because the model hasn't learned enough signal. Overfitting the training set is when the loss is not as low as it could be because the model learned too much noise. The trick to training deep learning models is finding the best balance between the two.
See section 6: Underfitting and Overfitting of the Introduction to ML notebook for more detail.
A model's capacity refers to the size and complexity of the patterns it is able to learn. For neural networks, this will largely be determined by how many neurons it has and how they are connected together. If it appears that your network is underfitting the data, you should try increasing its capacity.
You can increase the capacity of a network either by making it wider (more units to existing layers) or by making it deeper (adding more layers). Wider networks have an easier time learning more linear relationships, while deeper networks prefer more nonlinear ones. Which is better just depends on the dataset.
For example, the control model for comparison purposes is model, buut the wider and deeper models are wider and deeper respectively:
model = keras.Sequential([
# hidden layers
layers.Dense(units=16, activation='relu'),
# output layer
layers.Dense(units=1),
])
wider = keras.Sequential([
# hidden layers
layers.Dense(units=32, activation='relu'), # more units so a wider model (wider on diagram)
# output layer
layers.Dense(units=1),
])
deeper = keras.Sequential([
# hidden layers
layers.Dense(units=16, activation='relu'),
layers.Dense(units=16, activation='relu'), # another layer added so deeper model (extra hidden layer on diagram)
# output layer
layers.Dense(units=1),
])
This concept was briefly mentioned in section 6.2: XGBoosting: Parameter tuning of the Intermediate ML notebook.
When a model is learning too much noise, the validation loss may start to increase during training. To prevent this, we can simply stop the training whenever it seems the validation loss isn't decreasing anymore. Interrupting the training this way is called early stopping. Once we detect that the validation loss is starting to rise again, we can reset the weights back to where the minimum occured. This ensures that the model won't continue to learn noise and overfit the data.
Training with early stopping also means we're in less danger of stopping the training too early, before the network has finished learning signal. So besides preventing overfitting from training too long, early stopping can also prevent underfitting from not training long enough. Just set your training epochs to some large number (more than you'll need), and early stopping will take care of the rest.
In Keras, we include early stopping in our training through a callback. A callback is just a function you want run every so often while the network trains. The early stopping callback will run after every epoch. (Keras has a variety of useful callbacks pre-defined, but you can define your own, too.)
The function we are discussing is the EarlyStopping() from tensorflow.keras.callbacks, it is passed to the fit method. We will focus on three parameters:
min_delta: this is the minimum amount of change in score for us to consider it as an improvementpatience: this is the number of epochs of overall disimprovement before stoppingrestore_best_weights: whether to restore the model back to the best weights found, almost always set to TrueIn plain English, these parameters say: "If there hasn't been at least an improvement of 0.001 in the validation loss over the previous 20 epochs, then stop the training and keep the best model found.". We can implement this on the red-wine dataset as follows (use the same data as defined in the previous section):
from tensorflow.keras import callbacks
# define early stopping callback
early_stopping = callbacks.EarlyStopping(
min_delta=0.001, # minimium amount of change to count as an improvement
patience=20, # how many epochs to wait before stopping
restore_best_weights=True,
)
# define model
model = keras.Sequential([
# hidden layers
layers.Dense(units=512, activation='relu', input_shape=[11]),
layers.Dense(units=512, activation='relu'),
layers.Dense(units=512, activation='relu'),
# output layer
layers.Dense(units=1),
])
# loss and optimizer
model.compile(
optimizer='adam',
loss='mae',
)
# fit model
history = model.fit(
X_train, y_train,
validation_data=(X_valid, y_valid),
batch_size=256,
epochs=500,
callbacks=[early_stopping], # the callbacks go in a list
verbose=0, # turns off training log
)
# convert to dataframe for plotting
history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot();
print(f'Minimum validation loss: {history_df["val_loss"].min()}')
Minimum validation loss: 0.09701479971408844
As we can see, keras stopped the training after only $30$ epochs (not the $500$ inputted!). It is important to note that since we checked restore_best_weights=True, the model is now set the weights that correspond to the minimum of the validation loss, not necessary the last seen in the graph.
There's more to the world of deep learning than just dense layers. There are dozens of kinds of layers you might add to a model (Keras docs on layers). Some are like dense layers and define connections between neurons, and others can do preprocessing or transformations of other sorts.
In this section, we will learn about two kinds of special layers, not containing any neurons themselves, but that add some functionality that can sometimes benefit a model in various ways. Both are commonly used in modern architectures.
The next layer we will look at performs batch normalization (or batchnorm), which can help correct training that is slow or unstable.
With neural networks, it's generally a good idea to put all of your data on a common scale, perhaps with something like scikit-learn's StandardScaler or MinMaxScaler. The reason is that Stochastic Gradient Descent (SGD) will shift the network weights in proportion to how large an activation the data produces. Features that tend to produce activations of very different sizes can make for unstable training behavior.
Now, if it's good to normalize the data before it goes into the network, we can also instead normalize inside the network. A batch normalization layer looks at each batch as it comes in, first normalizing the batch with its own mean and standard deviation, and then also putting the data on a new scale with two trainable rescaling parameters. Batchnorm, in effect, performs a kind of coordinated rescaling of its inputs.
Most often, batchnorm is added as an aid to the optimization process (though it can sometimes also help prediction performance). Models with batchnorm tend to need fewer epochs to complete training. Moreover, batchnorm can also fix various problems that can cause the training to get "stuck" (this can be seen often when the training gives crazy high losses). Consider adding batch normalization to your models, especially if you're having trouble during training.
Let's do the same analysis as previous, but leave off standardising the data, to demonstrate how batch normalization can stabilise the training:
# Create training and validation splits
df_train = red_wine.sample(frac=0.7, random_state=0)
df_valid = red_wine.drop(df_train.index)
# Split features and target
X_train = df_train.drop('quality', axis=1)
X_valid = df_valid.drop('quality', axis=1)
y_train = df_train['quality']
y_valid = df_valid['quality']
Now we add batch normalization using the layers.BatchNormalization() method:
# define model
model = keras.Sequential([
# hidden layers
# layer 1
layers.Dense(units=512, activation='relu', input_shape=[11]),
# layer 2
layers.BatchNormalization(), # adding batch normalization
layers.Dense(units=512, activation='relu'),
# layer 3
layers.BatchNormalization(),
layers.Dense(units=512, activation='relu'),
# output layer
layers.BatchNormalization(),
layers.Dense(units=1),
])
# loss and optimizer
model.compile(
optimizer='adam',
loss='mae',
)
# fit model
history = model.fit(
X_train, y_train,
validation_data=(X_valid, y_valid),
batch_size=256,
epochs=500,
callbacks=[early_stopping], # the callbacks go in a list
verbose=0, # turns off training log
)
# convert to dataframe for plotting
history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot();
print(f'Minimum validation loss: {history_df["val_loss"].min()}')
Minimum validation loss: 0.5123989582061768
Typically, we get better performance if we standardize our data before using it for training. That we were able to use the raw data at all, however, shows how effective batch normalization can be on more difficult datasets.
We now look at the Dropout layer, which can help correct overfitting. In the last lesson we talked about how overfitting is caused by the network learning spurious patterns in the training data. To recognize these spurious patterns a network will often rely on very a specific combinations of weight. Being so specific, they tend to be fragile: remove one and the combination falls apart.
This is the idea behind dropout. To break up these combinations, we randomly drop out some fraction of a layer's input units every step of training, making it much harder for the network to learn those spurious patterns in the training data. Instead, it has to search for broad, general patterns, whose weight patterns tend to be more robust. So for example, a $50\%$ would result in only half the nodes in a layer being active in any given training step:
A $50\%$ dropout has been added onto the second hidden layer. Half of the inputs from the first hidden layer are disconnected at any given training step. Note that even though it looks as if we are disconnecting nodes in the above gif, it is meant that we are disconnecting different halves of the input data.
You could also think about dropout as creating a kind of ensemble of networks, this is similar to the idea behind Random Forests being an ensemble of Decision Trees. The predictions will no longer be made by one big network, but instead by a committee of smaller networks. Individuals in the committee tend to make different kinds of mistakes, but be right at the same time, making the committee as a whole better than any individual.
We will make an example of this using the spotify dataset. We do some previously unseen preprocessing in the below cell:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.model_selection import GroupShuffleSplit
# import data
spotify = pd.read_csv('Assets/spotify_data.csv')
X = spotify.copy().dropna()
y = X.pop('track_popularity') # target is the popularity
artists = X['track_artist'] # the group we will be splitting by
# split data
def group_split(X, y, group, train_size=0.75):
'''
Returns data split into training and validation based on a particular group.
'''
# spliiter used in next step
splitter = GroupShuffleSplit(train_size=train_size)
# splits data based on the group
# next gets next value in the iterator
train, test = next(splitter.split(X, y, groups=group))
return (X.iloc[train], X.iloc[test], y.iloc[train], y.iloc[test])
X_train, X_valid, y_train, y_valid = group_split(X, y, artists)
# features
numerical_cols = ['danceability', 'energy', 'key', 'loudness', 'mode',
'speechiness', 'acousticness', 'instrumentalness',
'liveness', 'valence', 'tempo', 'duration_ms']
categorical_cols = ['playlist_genre']
# create transformer to process data
preprocessor = make_column_transformer(
(StandardScaler(), numerical_cols),
(OneHotEncoder(), categorical_cols),
)
# preprocessing
X_train = preprocessor.fit_transform(X_train)
X_valid = preprocessor.transform(X_valid)
y_train = y_train / 100 # normalisation
y_valid = y_valid / 100
# logging
input_shape = [X_train.shape[1]]
print(f'Input shape: {input_shape}')
X_train
Input shape: [18]
array([[ 0.64880289, 1.20128975, 0.17342349, ..., 0. ,
0. , 0. ],
[ 0.49762532, 0.64331412, 1.55781275, ..., 0. ,
0. , 0. ],
[ 0.14716822, 1.28415742, -1.21096577, ..., 0. ,
0. , 0. ],
...,
[-0.85610112, 0.67646118, 0.17342349, ..., 0. ,
0. , 0. ],
[-0.18954546, 1.04660343, -0.93408792, ..., 0. ,
0. , 0. ],
[-0.34759474, 1.02450539, -0.10345436, ..., 0. ,
0. , 0. ]])
For comparison purposes, let's first create a model without dropout:
# define model
model = keras.Sequential([
# hidden layers
layers.Dense(units=128, activation='relu', input_shape=input_shape),
layers.Dense(units=64, activation='relu'),
# output layer
layers.Dense(units=1),
])
# loss and optimizer
model.compile(
optimizer='adam',
loss='mae',
)
# fit model
history = model.fit(
X_train, y_train,
validation_data=(X_valid, y_valid),
batch_size=512,
epochs=50,
verbose=0,
)
# convert to dataframe for plotting
history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot()
print(f'Minimum Validation Loss: {history_df["val_loss"].min()}')
Minimum Validation Loss: 0.1921459287405014
We can see that the validation loss quickly diverges while the training loss continues to decrease. This is an indication of overfitting in the model.
Now we will create the same model, but adding dropout using the layers.Dropout() method. The rate parameter controls how much we exclude:
# define model
model = keras.Sequential([
# hidden layers
# layer 1
layers.Dense(units=128, activation='relu', input_shape=input_shape),
# layer 2
layers.Dropout(rate=0.3), # 30% dropout rate on next layer
layers.Dense(units=64, activation='relu'),
# output layer
layers.Dropout(rate=0.3), # 30% dropout rate on next layer
layers.Dense(units=1),
])
# loss and optimizer
model.compile(
optimizer='adam',
loss='mae',
)
# fit model
history = model.fit(
X_train, y_train,
validation_data=(X_valid, y_valid),
batch_size=512,
epochs=50,
verbose=0,
)
# convert to dataframe for plotting
history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot()
print(f'Minimum Validation Loss: {history_df["val_loss"].min()}')
Minimum Validation Loss: 0.18726733326911926
We can see that the validation loss remains near a constant minimum even though the training loss continues to decrease. So adding dropout did prevent overfitting this time. Moreover, by making it harder for the network to fit spurious patterns, dropout encouraged the network to seek out more of the true patterns, improving the validation loss some as well.
So far in this notebook, we have looked at how neural networks can solve regression problems. Now let's apply neural networks to another common machine learning problem: classification. Most everything we've learned up until now still applies. The main difference is in the loss function we use and in what kind of outputs we want the final layer to produce.
Classification of an input into one of two classes is a common machine learning problem. You might want to predict whether or not a customer is likely to make a purchase, whether or not a credit card transaction was fraudulent, whether deep space signals show evidence of a new planet, or a medical test evidence of a disease, etc.. These are all binary classification problems.
In your raw data, the classes might be represented by strings like "Yes" and "No", or "Dog" and "Cat". Before using this data we will assign a class label: one class will be $0$ and the other will be $1$. Assigning numeric labels puts the data in a form a neural network can use.
Accuracy is one of the many metrics in use for measuring success on a classification problem. Accuracy is the ratio of correct predictions to total predictions: $\mathrm{accuracy} = \frac{\mathrm{number\,of\,correct\,predictions}}{\mathrm{total\,predictions}}$. A model that always predicted correctly would have an accuracy score of $1$. All else being equal, accuracy is a reasonable metric to use whenever the classes in the dataset occur with about the same frequency.
The problem with accuracy (and most other classification metrics) is that it can't be used as a loss function. Stochastic Gradient Descent needs a loss function that changes smoothly, but accuracy, being a ratio of counts, changes in "jumps". So, we have to choose a substitute to act as the loss function. This substitute is the cross-entropy function.
Now, recall that the loss function defines the objective of the network during training. With regression, our goal was to minimize the distance between the expected outcome and the predicted outcome. We chose MAE to measure this distance.
For classification, what we want instead is a distance between predicted probability of a given class being the correct one, and whether it is actually the correct class (should measure $1$ if it is, $0$ if not).
Figure 5: The cross-entropy conversion from the networks certainty on the correct class to loss. The network creates a predicted probability for each class, the probability of whether that class describes the input. If the network predicts a probability of $0.4$ for the actual correct class, it will have a loss of approximately $1$.
The technical reasons we use cross-entropy are a bit subtle, but the main thing to take away from this section is just this: use cross-entropy for a classification loss; other metrics you might care about (like accuracy) will tend to improve along with it.
The cross-entropy and accuracy functions both require probabilities as inputs, meaning, numbers from $0$ to $1$. To covert the real-valued outputs produced by a dense layer into probabilities, we attach the sigmoid activation function (others can be used of course).
To get the final class prediction, we define a threshold probability. Typically this will be $0.5$, so that rounding will give us the correct class ($B$ for this example): below $0.5$ means class $A$ and $0.5$ or above means class $B$. A $0.5$ threshold is what Keras uses by default with its accuracy metric.
We will now look at an example using the Hotel Cancellations dataset. We import it as follows:
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
# read data
hotel = pd.read_csv('Assets/hotel_data.csv')
# get features and target
X = hotel.copy()
y = X.pop('is_canceled') # target is whether the stay was cancelled
# map the months onto numbers
X['arrival_date_month'] = X['arrival_date_month'].map(
{
'January':1, 'February': 2, 'March':3,
'April':4, 'May':5, 'June':6, 'July':7,
'August':8, 'September':9, 'October':10,
'November':11, 'December':12,
}
)
# features
numerical_cols = [
"lead_time", "arrival_date_week_number",
"arrival_date_day_of_month", "stays_in_weekend_nights",
"stays_in_week_nights", "adults", "children", "babies",
"is_repeated_guest", "previous_cancellations",
"previous_bookings_not_canceled", "required_car_parking_spaces",
"total_of_special_requests", "adr",
]
categorical_cols = [
"hotel", "arrival_date_month", "meal",
"market_segment", "distribution_channel",
"reserved_room_type", "deposit_type", "customer_type",
]
# transform features pipelines
numerical_transformer = make_pipeline(
SimpleImputer(strategy="constant"), # there are a few missing values
StandardScaler(),
)
categorical_transformer = make_pipeline(
SimpleImputer(strategy="constant", fill_value="NA"),
OneHotEncoder(handle_unknown='ignore'),
)
# combine them into one transformer
preprocessor = make_column_transformer(
(numerical_transformer, numerical_cols),
(categorical_transformer, categorical_cols),
)
# split data
# stratify - makes sure classes are appropriately represented across splits
X_train, X_valid, y_train, y_valid = train_test_split(
X, y,
stratify=y,
train_size=0.75
)
# transform data
X_train = preprocessor.fit_transform(X_train)
X_valid = preprocessor.transform(X_valid)
# logging
input_shape = [X_train.shape[1]]
print(f'Input Shape: {input_shape}')
X_train
Input Shape: [63]
array([[-0.40089967, 1.67390963, -0.89034165, ..., 0. ,
1. , 0. ],
[ 0.96836046, -1.19145173, 0.02203711, ..., 0. ,
1. , 0. ],
[-0.44779214, -1.77921816, 0.36417915, ..., 0. ,
1. , 0. ],
...,
[-0.9729878 , -1.63227655, -1.57462572, ..., 0. ,
1. , 0. ],
[-0.3540072 , 1.45349722, 0.36417915, ..., 0. ,
0. , 0. ],
[ 1.27785076, 0.71878917, -0.66224696, ..., 0. ,
0. , 1. ]])
We create a model using Batch Normalization and Dropout:
# define model
model = keras.Sequential([
# hidden layers
# layer 1
layers.BatchNormalization(input_shape=input_shape),
layers.Dense(units=256, activation='relu'),
# layer 2
layers.BatchNormalization(),
layers.Dropout(rate=0.3),
layers.Dense(units=256, activation='relu'),
# output layer
layers.BatchNormalization(),
layers.Dropout(rate=0.3),
layers.Dense(units=1, activation='sigmoid'),
])
Now compile the model with the Adam optimizer and binary versions of the cross-entropy loss and accuracy metric. Then perform the usual fitting and plotting:
# loss, metrics, and optimizer
model.compile(
optimizer='adam',
loss='binary_crossentropy',
metrics=['binary_accuracy'],
)
# early stopping callback
early_stopping = callbacks.EarlyStopping(
min_delta=0.001,
patience=10,
restore_best_weights=True,
)
# fit the model
history = model.fit(
X_train, y_train,
validation_data=(X_valid, y_valid),
batch_size=512,
epochs=200,
callbacks=[early_stopping],
verbose=0, # no logging
)
# convert to dataframe for plotting
history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot(title="Cross-entropy")
history_df.loc[:, ['binary_accuracy', 'val_binary_accuracy']].plot(title="Accuracy")
<AxesSubplot:title={'center':'Accuracy'}>
The top figure is the loss (cross-entropy), while the below figure is the accuracy. The accuracy rose at the same rate as the cross-entropy fell, so it appears that minimizing cross-entropy was a good stand-in. All in all, it looks like this training was a success!
This is a notebook attached to the Kaggle course. It details how to detect the Higgs Boson from the enormously large dataset from the Large Hadron Collider in 2012. It uses TPUs to parse through this much data, and so is not covered here of course. It is recommended to look at the notebook in any case, as it has some good practices and insights into heavier deep learning applications.